DOCOMENT RESUME 



ED 080 312 



SE 016 110 



AUTHOR 
TITLE 

INSTITOTION 

PUB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Aikenhead, Glen S« , 

A New Methodology £or Test Construction in Course 
Evaluation* . 

Saskatchewan l|niv., Saska toon • Dept. of Curriculum 

Studies* 

Mar 73 

14p«; Paper presented at the annual meeting of the 
National Associati<m. for Research in Science Teaching 
(46th, Detroit, Michigan, March 1973) 

MF-$0«65 HC-$3.29 

Educational Research; ^Evaluation; ^Evaluation 
Techniques; ^Physics; Science Education; Secondary 
School Science; *Test Construction; ^Testing 
Research Reports 



ABSTRACT t 

A new method of constructing tests is presented in 
this article for the purpose of developing a test from student 
perception of the course. .The Test ^Understanding science (TOUS) 
and the Science Process Inventory (SPI) were used as sources of 
items* A random sub-sample of 921 students, taking bot)i the pretest 
and posttest of TOUS and SPI during the 1967-68 Project Physics (PP) 
experimental period, served as sources of empirical data* The McNemar 
chi square analysis was used to seliect test items en^irically* Every 
it^m was analyzed with respect to the changes in stiklent responses 
between the pretest and posttest. The items showing a statistically 
significant change in response were combined into a single instrument 
called ''A Measurement of Knowledge About Science and Scientists 
(Project Physics; Form 1)** (KASSPP1) • .Another independent random 
sub*sample of 64 students was tasted to describe the statistical 
at'teibutes of KASSPP1. .Findings showed that KASSPP1 had a greater 
predictive validity for PP than either TQDS or SPI« .Application gt 
the present method to formative evaluation was recommended* ^ (CC) 



FILMED FROM BEST AVAILABLE COPY 




Information and 

Research 

Report 



Department off Currkulum Studies 

College of bbcotion 
Uihrenlty of Soskof^iwan 
Soskatooi, ComnIo 



A NB^ MBTHOIDLOGV FOR TEST OONSTRUCTION 
IN GOURSE EVALUATION 



Dr« Glen S. Aikenhead 
Assistant Professor 
Science Education 



Paper presented at the 46th annual meeting of the 
National Association fdr Research in Science Teaching 
lETROIT, MiailGAN 
Nardi 26 - 29 » 1973 



INTROOiCTION 



Test validity has alwa>'s been a prime concem to an cvaluator. 
Test developers carefully construct or select test items that yield 
greatest validity because this quality in an instrument lends credence 
to the results that emerge from its use. This report presets a new 
nethod in test omstruction «diich augirats an instrjnent's 
validity b^rond that possible by standard procedures. 

VALIDITY 

A s&)rt <telineation of validity sexves to clarify the issues 
discussed in this report. Validity » "tiie degree to whldi a test 
is capable of achieving certain aiats," mg^ be characterized by the 
£olloNing four categories: content, predictive omstruct, aid 
concurrent validity.^ 

Content validity simply refers to the substance or content 
erf a test's being representative of the content of the pn^rty 
being measured. Of toi, content validity relies on the consistency 
betMeen test itei^ md the writings of respected authors. Models 
and outlines of a subject field and "expert" judgements are usually 
utilized for tiiis purpose. 



Predictive validity concerns an instxunent's ability to pre- 
dict certain observable performances. That is, £rm a statement of 
tdiat a test claijns to measure and from a person's score, one can infer 
or predict certain behavior. For instance, the College Board Entrance 
Eiraminations predict to saae degree a student's success in his first 
year of university. In addition, an instriment would be thought to 
have hi^ predictive validity if it would distinguish between students 
niho had undergone a certain treatnent ani students i^ had not; for 
exaople, a test purported to measure a student's appreciation of poe- 
try \mid have high predictive validity if it could identify tiiose 
studants idio had taken a course in poetry i^ipreciation. A mathematics 
tMt Kfould have no predictive validity if it could not discriminate be- 
tween persons who have studied the si^ject extensively and those mAh) 
havB not. 

A test*s construct validity is detendned by knowing what 
fycton ^diological prt^rties) ''ea^lain" the YBxUmce in the scores 
of that test. The, coistxucts in coistruct validity refer to the 
attributes that are sifiposedly reflected in the test performance. 
Bactor analysis and canonicil correlation are tools in ^termining 
oonstruct validity.^ for exasple, a measure of IQ is a popular 
psydiological property and researchers are often interested in know- 
ing to what extent IQ scores "es^lain" the variance in tiie scores 
of a certain instnaent. 

Ooneurrent validity refers to tiie degree of consistency among 
scores of similar tests. That is , a aew instruaent may be thou^t to 
be valid if it is shown to yield similar results as an established 
InstruMnt. A Untor analysis idth such a reference test is ideal. 



TEST CDNSTOudlON ^E^HODDIJOGy 
Standard pxocedures in test construction include fbimilating 
a larse set of items i4iidi appear to test a student's grasp of an 
idea, concept or relationship in a given subject area (the learning 
cue hopes will occur) . this is believed to establish content val- 
idity. After a trial run, iteois are elioinated for yarlois reasons: 

(a) poor or ^jii>igMous «K>rding leading to lew reliability, 

(b) lew "point biserial correlation" (a aeasure of an in- 
dividual item's ability to differentiate between a 
studoit scoring hi^ cm the total test and a student 
scoring low on the total test) .. It is thought that 
itais that differentiate to a high degree are useful 
in aupnenting the range of total test scores ; that is , 
increase the ease for ranking stud^ts. For instance, 
ii there are iteas for idiidi "poor" studaits invar- 
iably do well «diereas 'better" students do not, then 
these it^ Mould be deleted from t^ test. Also, 
it^ that are too difficult and too easy wmdd have 
low point biserial correlations. 

(c) closely associated with (b) , inability to discriminate 
between "experts" and "non-eaqjerts" in the field being 
tested. 

The "good" items are coiroined into a test and stfcjected to further 
trials. The elimination process increases the predictive validity 
of tiie instnment. Further analysis wo\Jld be necessary to ^tsd)lish 
cpnstruct and concurrent validity if this were considered desinftyle. 

* The content of each iten of the test corresponds to what the 
test writer believes to be the course objectives. That is, each 
itea reflects tiie course content as seen through the eyes of the 
tgit aonstwctor . Gains in student achievcaent on these instrtmts 



only coincide with what the teacher or curriculim developer HOPK 
students learn. If students do not gain significantly, one assunes 
they have not learned sufficiently. But what is the course content 
as seen thiou^ the eyes of the student? If students do not gam 
significantly, then the hoped-for learning did not take place be- 
cause it was not adequately provided. 

In evaluating a course, one should ask: What do students 
generally learn? Only after further analysis would course objectives 
be considered: Which objectives were accoHplished and vhidx were not? 
Miat learning - positive and negative - took place that was not en- 
cxaiiassad by the course objectives? (A question all too few eval- 
uatozs coittider). 

Hie standard procedure for test oaistiuction narrwly deals 
with hoped-for learning based on a fragile correspondence among course 
objectives, course content, student <aqperi«ice in that course, and 
tlie test iteae' content. An altexnstive procedure has been devel- 
oped that tends to overcome these liaitations in the standard method 
of test construction. 

AN AL-reSNWIVE PAKADKW 
1 propose a new procedure for test dcveldpnent. This alter- 
native paradiga is quite single: a test is constructed firom the 
stud^ts* perception of the course, instead of the teacher's or 
curricultia developer's vision of the course. This n«< methodology 
gives hi^ predictive validity to the evaluative instnment and, 
m deiwnstrated shortly, yields, greater feeAack to the teacher. 



TO develop = » 

. .„t items from m o"**' sources. 
^1, selects i^e^ are b»»ter in scope 

aosting validated instn»«nts. Cn«' . ..ntent vl- 

ohiectives, assuring high content v»» 
^ individual course objectives, » 

, ..sis of this selection tests >*on th. dSE 
Uit,) . the -^itxcal basis ^ . 

^ C«t just -r^J^ ^. Ue ite- 

*-» .na posttest administration of 

between a p««st ^ 

^oti^li-^-nts. '^•"^'^^^^.sig^ficant 
. .i^ficant i^t. 0. on *i* s^J 

L • «-api- o£ correct icsponses. «»» 
" rirl- Of eval^ticn in educati. U to 

""Zl^T^vilx lean-r. and gr^ — 

Tto jroposed p«r.digi. . e., Aeir AiUty to in- 

oft.,t.ich.cc.ris.d^.»^-«-^_^ 

41ct. . ctive validity for d«t 

^^.t».ts^--;^t;::i.-- 

course ^ ^^d 1^ ,i.ia a 
^ j^tive valiixtr ^ ^ test 

«rtid twi to fho» -w *2E '"^ 



insttuiwts*. 



-6- 

The cnpirical nature of the item selecticm energes from a 
statistical item analysis, bver;.' iter.: is analyzed with respect to the 
dianges in student responses between the pretest and posttest. There 
ate only two dxanges possible: (a) from a-i incorrect response to 
correct response, and (b) £rOT a correct re?p«ise to an ticorrect 
response. The probability is .50 that students vrfio change their 
response move to a correct answer. McNcnar^ has derived a chi 
sc^re test specifically lor this situation: 

2 f A - ni^ ^ degree of freedom*, i*»ere A 

- ■ A .- A ' - Pi — and D are cell frequencies in the 
( A ♦ D] following oxitingency table: 



Iten X 



Posttest 



1 

5- 

■f- 
k 

i 



Pretest 



0 
1 



1 


0 


A 


B 


C 


0 



I signifies a correct 
response 

0 signifies an incorrect 
respcmse 



Hie dd square malysis detexnines which iteos experirace a 
sutistically significant diaige in student respoi^e beti^en the 
pretest and posttest. Hiese itms nsf be cairi>ined into a single 
instnment, a test specifically applicable to a givai ccwrse. In 
addition, one may infer fihat krwwledge students appeared to gain or 
lose during the interim. 



•When MD<20, a Yate's correction for sasll frequencies must be in- 
troduced. The equation beconcs: -x*- (|A-D| -l)'^ / (A+D) . 



-7- 

FIELD TRIAL 

Description 

In an eiqperiinental stody, this new nethcdclogy for ccnstTuct- 
ing tests swxessfully led to a inique evaluation instrosnent for the 
new physics course by Holton, Rutherford and Watson, Harvard Project 
fhysics (HPP)'* "Oie investigation iced two validated instmnents , 
tlie Test on Understanding Science (TWB)^ ^nd the Science Process 
Inventory (SPI)^, as the original source of items. Many of the 
HfP's major objectives appear to be aicoi|>assed by the contmt of 
the lots and SPI . The HPP student responses to the T0I5 and SPI 
iteflB (pretest and posttest) yielded the eapirical data for this 
study. 

The students were part of a nation-Wi.de evaluation of HPP. 

Fifty- five teachers were selected at random from the populaticRi of 

7 

Aaerican and Canadian i^sics teachers. T!^e randonly selected 
teadiers were again rand»ily split into two groi^: thirty- five 
tau^t HPP While twenty served as a control gr6ij|> teaching dieir 
usual pities courses (ran-HFP) . The grcNJ|> attotded a simier 
iistitute to prepare theu to teach the new couzse. In addition, 
there was another giovi) of tochers ei^rienced in teadiing HPP. 
These nineteen had volunteered to participate in the evaluation 
project. Ihey tau^t in various reg^ms of the Uiited States. 
Ihe maijer of students stui^ing HPP in the evaluation project 
totalled 2,9S0. From this group, 921 students were randomly 
chosen to write both the pretest and posttest TOUS or SPI.^ 

The first ooniiereial edition was piA>lished m Project Physics. 
Sqptari^, 1970, Holt, Rinduirt | Winston. 



-8- 

Sifldlarty, sixty- four students were randomly selected to write 

both tests both times. (The HPP evaluation project used many 

other instruments. This randoadzation reduced the total tirae 
9 

taken by tes|^. ) 
Results 

The TOUB and SPI supplied 195 items of which 101 items were 
selected by the observed significant dianges in student knowledge 
over a year of stu«fying HPP. This derived HPP test was called 
"A Jfeasurement of Knowledge About Science and Scientists (Project 
H^ics: Vom 1)", abbreviated KASSPPl. The test Induies ninety- 
five it€« which showed a significant positive gain and six items 
Whidi cj^ienced a significant negative gain in student response. 
Hie rationale for including these six iteas is presented belw. 

Sone quantitative attributes of the KASSPPl may be foimd 
in Table I. 



TABLE I 

QaA.>fnTATI\'E miA KjR the KASSPPl 
BASED OH lliE mmi SUBSAf^LE OF HPP snjims 





Mean 
Score 


• 

SD* 


Ranga 


Pre-post 
Correlation 


Reliability^ N 
estinate 


pjcete^t 


70.70 


7.403 


52 - 82 


.76 


.79 64 


Posttest 


78.16 


8.663 


53 - 95 







^ D n^uis standard deviation. 
^Xuder Richardson foinula-20 was used. 



-9- 

These data were obtained from the randoa siisaraple of sixty- four 
HPT stuients. Tne mean scores lay half between a score obtain- 
able by pure chance (^7 points) and a perfect score (101 points) . 
The standard deviation (7.403 on the pretest) was similar to the 
TJUS (7.13)^^ but was not as large as the standard deviation of the 
SPI (13.1)^^. The range of scores showed a spread of 30 to 40 

points. TWO different estijnates of the test's reliability conpare 

12 

favorably with the standards established by Davis " for measuring 
group and individual diaracteristics. The relationship between 
XASSPPl scores and measures of reading and ''mental" ability is 
•ssuned to fall within the range set by tiie TOUB and SPI (correl- 
ation coefficients of .47 to .66).^^ 

In constructing the KASSPPl, primaiy consideration was given 
to its ability to reflect changes in student toios^'ledge, and not to 
maQdmizing its quantitative attributes. Thus, items were included 
liiat showed a negative change between the pretest and posttest. If 
the nunber of it«ns experiencing negative pii» was relatiwly 
saall , one would expect the KASSPPl to yield larger gain scores 
than either the TOUS or SPI. This expectation was fulfilled by 
the results of the independent analysis of a random subsain>le of 
HPP students. This sitosanple's KASSPPl gain score (7.46 points) 
exceeded its TOUS gain score (2.76 points) and its SPI gain (S.80 
points) . These results doanented the increased predictive val- 
idity of the KASSPPl. With regard to the nmber of items that 



I 

i 



ERIC 



•10- 

experienced a significant change between the pretest and posttest, 
the KASSPPl also yielded niore information than either the TCUS or 
SPT. The random subs^jnple oi I7P students danons irated a very large 
ir,?)rovenjent for 50 TDUS niid SPI ite^ns, 261 of all 195 itens. Equally 
large gains were ac::a!^ix;lished for 38 K:\SSP7i items, 3S\ of .the total 
101 itCTis. Thus, the one test (iC\SSPPl) was able to yield a some- 
liiat higher proportion (38% conpared wj ' \ T items on which stu- 
dents dramatically moved toward the correct response. Tliis result 
suggest! that the KASSPPl is more efficient than the TOIJS and SPI 
in yielding feedback for HPP teachers and students • 



' IMPLICATIONS FOSL FURTHER RESE/\RQi 

The developcient of the KASSPPl illustrates a novel method of 
test construction: general mdi valid instruments are utilized by 
er^irically selecting items lAich prove to be inost appropriate to a 
particular course. Many studie;^ inay be conducted by ploying this 
new paradigm* For any curriculiw in its early stages of development 
the following procedures are suggested* 

(a) General tests, such as the IDUS and SPI, w.cmld be 
used to <Aserve >^t knowledge students tended to 
learft wh€fn studying a new course. The infopiation 
nay lei^ to alterations or shifts in esqphasis in 
tt4 curriculum materials. 

(b) The testing would tJmt be repeated for the last 
revisi<Hi of ti^e curriculim. Tests iqpplicable to 
the new ooio^ could then be derived, 

(c) The curriculim project's package of evaluation 
^triMenti would include these derived tests. 
This is especially useful ;n tl^ case irfiere the 
derived instniments concern IcnowleOge traditionally 
thwght to lie beyond the realm of standard subject 
matter; for instance, knowledge closely related 

to students* iiiq>resslons and attitudes « 



-11- 

Not only did tlie new method of item selection yield an objective 
test about science and scientists > it also supplied sufficient data 
for partially evaluating the HPP course.^* By examining the items 
that experienced significant gains or losses » one can recognize 
i Terences between HPP and other physics courses. TTiese differences 
were defined in tems of the knowledge students tended to acquire 
rather than in terms of differences in student mean scores. Student 
achievement was also compared with the objectives of HPP, ^ Such 
analyses and comparisons correspond to major components of formative 
evaluation. 



REFnPJJs'CES 

1. Tedmical Reconmendations for Psychological Tests and Diagnostic 

Ted-iniques . Psychological Bulletin, 5l (Siyplement, 1954), l3 . 

2. l*yo, Samiel T. "The Vethodology and Technology of Educational and 

Psydiological Testing." Re view of Educational Researcli , 58 
(Fcbruaiy, 1968), 92-101. 

3. Davis, Frederick B. "Testing and tl^e Use of Test Residts." Review 

of Educational Research , 32 (February, 1962), 5-14. 

4. IifcNemar, Quinn. Psychological Statisti cs. 4th ed. New York: John 

Wiley and Sons, 1969. 

Rnd Klopfer, Leo. . „ . 

5. Cooley. William v«'., . Test on Understanding Science. Princeton, New 

Jers^ : Educational Testing Service, 1961. 

6. Welch, Wayne W, Weldi Science Process Inventory , fom D . Cambridge, 

Mass.: Harvard Project Physics, 1966. 

7. Welch, Wayne W; Walbert, Herbert J.; and Ahlgren, Andrew. "Hie 

Selection of a National Raxdom Sample of. Teachers for Experi- 
neital Curriculun Evlauation." School Science and Mathmatics , 
49 CMarch, 1969), 210-216. 

8. Ailcenhead, Glen S. "The Measurement of Knowledge About Science 

snd Scientists: An Investigation into the Develojwient of 
Instrunents for Rjrmatiw Evaluation." IMpublished Doctoral 
Thesis, Harvard University, 1972. p. 91 § pp. 181-188. 

9. Walberg, HeiberJ J. and Weldi, Wayne W. "A New Itee of Randomization 

in Experimental Curriculum." School Reviettf (Winter, 1967). 

10. Cooley, William W. . and Klopfer, Leo. Manual : Test on Understanding 

Science. Princeton, New Jersey: Educational Testing Service, 
1961. 

11. Weldi, MafnB W. "Welch Science Process Inventory, Form D: Suniaaiy 

of Infoxiaation." 104 Burton Hall, University of MLniwsota, 
Minneapolis, MLmi. : Dr. Wayne W. Weldu Q^imeogra^^ed, no 
dfte listed.) 

12. Davis , Frederick B . EAicational Measurements and their Interpretations. 

Belmont, California: Wadsworth, 1964. 

13. Aikenhead, Glen S. 'The ^^easurement of Hi^ School Scudents' l&iowledge 

About Science and Scientists." Unpublished Qualifying Paper, 
Harvard Iftiiversity, 1970. 



14. Aikenhead, Glen S. "Ihe Interpretation of Student Perfonnance on 
Evaluative Tests." Paper presented at the 46th annual 
neeting of tiie Naticmal Association for Research in Science 
Teaching, Detroit, !terdi 26-29, 1973. 



